# Notice that age 0 stands for NA
data$vict_age[data$vict_age == 0] <- NA
# age distribution
ggplot(data, aes(x = vict_age)) +
geom_histogram(binwidth = 5, fill = "#433E85FF", color = "black", alpha = 0.7) +
theme_minimal() +
labs(title = "Age Distribution of Victims", x = "Age", y = "Frequency")

The first figure is a histogram showing the frequency distribution of
victims across different age ranges. The x-axis represents age, ranging
from 0 to 120 while the y-axis represents frequency, with values up to
100,000 or more. The distribution is approximately bell-shaped,
indicating a concentration of victims in the middle age ranges (around
30-50), with fewer victims at younger (0-20) and older (70+) ages, which
suggests a normal-like distribution with a slight skew towards younger
ages.
# divide age into four categories
data$age_group <- cut(
data$vict_age,
breaks = c(-Inf, 18, 40, 60, Inf),
labels = c("juvenile", "Young adult", "Middle-aged people", "The elderly"),
right = FALSE
)
ggplot(data[!is.na(data$vict_age), ], aes(x = age_group)) +
geom_bar(fill = "#25858EFF", color = "black", alpha = 0.7) +
theme_minimal() +
labs(
title = "Age Group Distribution of Victims",
x = "Age Group",
y = "Count"
)

The second figure is a bar plot showing the count of victims across
predefined age groups. The x-axis categorizes victims into four groups,
Juvenile, Young adult, Middle-aged people and the elderly. The y-axis
represents the count of victims in each age group, with values ranging
from 0 to over 400,000. The Young adult group has the highest count,
followed by Middle-aged people and The elderly. The Juvenile group has
the lowest count. The distribution emphasizes the disproportionate
representation of victims in the young adult and middle-aged
categories.
# Create a boxplot showing age distribution by area
box_age_area <- ggplot(data, aes(x = area_name, y = vict_age, fill = area_name)) +
geom_boxplot(outlier.color = "black", outlier.size = 0.5, alpha = 0.7) +
theme_minimal() +
labs(
title = "Age Distribution by Area",
x = "Area",
y = "Victim Age",
fill = "Area"
) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
# Display the plot
ggplotly(box_age_area)
This box plot visualization provides insights into how victim age varies
across different areas, showing both central tendencies (medians) and
variability (IQRs and ranges). Each box plot represents the distribution
of ages for victims in a specific area. The box indicates the
interquartile range (IQR) — the middle 50% of data. The horizontal line
inside each box marks the median age. The “whiskers” extend to show the
range of the data, excluding outliers. Dots beyond the whiskers
represent outliers, which are ages that fall significantly outside the
typical range for that area. Overall,some areas may have younger or
older median victim ages. Areas with taller boxes or whiskers have more
variability in victim ages.
# Crime Severity Distribution by Age Group
data$severity_label <- ifelse(data$part_1_2 == 1, "Serious", "Less Serious")
# Significant test: the relationship between different age groups and crime severity
# turn severity into factor
data$severity_label <- as.factor(data$severity_label)
# Chi-squre test
severity_age_table <- table(data$age_group, data$severity_label)
chisq_test <- chisq.test(severity_age_table)
print(chisq_test)
##
## Pearson's Chi-squared test
##
## data: severity_age_table
## X-squared = 4762.7, df = 3, p-value < 2.2e-16
# output
if (chisq_test$p.value < 0.05) {
print("Age group has a statistically significant relationship with crime severity.")
} else {
print("No significant relationship between age group and crime severity.")
}
## [1] "Age group has a statistically significant relationship with crime severity."
ggplot(data[!is.na(data$vict_age), ], aes(x = age_group, fill = severity_label)) +
geom_bar(position = "fill", alpha = 0.7) +
theme_minimal() +
labs(
title = "Crime Severity Distribution by Age Group",
x = "Age Group",
y = "Proportion",
fill = "Crime Severity"
)

The figure showing the proportion of crime severity categories for
each age group. For juveniles, the proportion of serious crimes is
relatively low (about 30%), with the remaining 70% being less serious
crimes. For young adults, the proportion of serious crimes increases to
aroung 55% and the remaining are less serious crimes. For middle-aged
people and the elders, the pattern is similar to young adults, with
slightly fewer serious crimes.
This visualization highlights how crime severity is distributed across
age groups. The proportion of serious crimes tends to decrease with
increasing age groups, except for the juvenile. Juveniles are involved
in a lower proportion of serious crimes compared to older age
groups.
# Calculate the proportion of crime severity for each gender
# Filter out rows where vict_sex is "-" or NA
clean_data <- data[!is.na(data$vict_sex) & data$vict_sex != "-", ]
# Recode gender codes with clearer labels
clean_data <- clean_data %>%
mutate(gender_label = recode(vict_sex,
"F" = "Female",
"M" = "Male",
"H" = "Intersex/Other",
"X" = "Unknown"))
# Calculate the proportion of crime severity for each gender
severity_gender_data <- clean_data %>%
group_by(severity_label, gender_label) %>%
summarise(count = n(), .groups = "drop") %>%
complete(severity_label, gender_label, fill = list(count = 0)) %>%
group_by(gender_label) %>%
mutate(percentage = count / sum(count) * 100)
severity_gender_table <- severity_gender_data %>%
arrange(gender_label, desc(percentage))
# Display the table
kable(severity_gender_table, format = "html", caption = "Crime Severity by Gender and Percentage") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| severity_label | gender_label | count | percentage |
|---|---|---|---|
| Less Serious | Female | 197898 | 56.10149 |
| Serious | Female | 154852 | 43.89851 |
| Serious | Intersex/Other | 70 | 62.50000 |
| Less Serious | Intersex/Other | 42 | 37.50000 |
| Serious | Male | 237003 | 59.73325 |
| Less Serious | Male | 159766 | 40.26675 |
| Serious | Unknown | 57399 | 60.70050 |
| Less Serious | Unknown | 37162 | 39.29950 |
# Plot the pie chart with values annotated
ggplot(severity_gender_data, aes(x = "", y = percentage, fill = severity_label)) +
geom_bar(stat = "identity", width = 1, alpha = 0.7) +
coord_polar(theta = "y") +
facet_wrap(~ gender_label) + # Use the recoded gender labels
geom_text(aes(label = paste0(round(percentage, 1), "%")),
position = position_stack(vjust = 0.5), size = 4) + # Add percentage labels
theme_minimal() +
labs(
title = "Crime Severity Distribution by Gender",
x = NULL,
y = NULL,
fill = "Crime Severity"
) +
theme(
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank()
)
The chart provides a clear comparison of crime severity levels across
gender categories, highlighting distinct patterns in crime involvement
based on gender. Each pie chart represents the proportion of crime
severity levels for different gender categories. Intersex/Other has the
highest proportion of Serious crimes (62.5%), followed by the Unknown
category (60.7%) and Males (59.7%). Females have the highest proportion
of Less Serious crimes (56.1%). There is a clear pattern where males and
non-binary categories (Intersex/Other, Unknown) are associated with a
greater proportion of serious crimes, while females have a higher
association with less serious crimes.
# Filter data to remove invalid or missing race entries
clean_data <- data %>%
filter(!is.na(vict_descent) & vict_descent != "-")
# Map race codes to full descriptions and group small groups as "Others"
clean_data <- clean_data %>%
mutate(
vict_descent = recode(vict_descent,
"B" = "Black", # Map "B" to "Black"
"H" = "Hispanic", # Map "H" to "Hispanic"
"W" = "White", # Map "W" to "White"
"X" = "Unknown", # Map "X" to "Unknown"
"O" = "Others", # Map "O" to "Others"
.default = "Others") # Group any unspecified codes as "Others"
)
# Step 3: Calculate the proportion of each race
race_distribution <- clean_data %>%
group_by(vict_descent) %>%
summarise(count = n(), .groups = "drop") %>%
mutate(percentage = count / sum(count) * 100) %>%
mutate(vict_descent = ifelse(percentage < 5 | vict_descent == "Others",
"Others", vict_descent)) %>%
# Merge small groups (<5%) into "Others"
group_by(vict_descent) %>%
summarise(count = sum(count), percentage = sum(percentage), .groups = "drop")
# Recalculate totals
# Create the pie chart
ggplot(race_distribution, aes(x = "", y = percentage, fill = vict_descent)) +
geom_bar(stat = "identity", width = 1, alpha = 0.7) +
coord_polar(theta = "y") +
theme_minimal() +
labs(
title = "Racial Distribution of Victims",
x = NULL,
y = NULL,
fill = "Race"
) +
geom_text(aes(label = paste0(round(percentage, 1), "%")),
position = position_stack(vjust = 0.5), size = 4) +
theme(
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank()
)

This pie chart highlights the racial diversity of the victim population, representing the proportions of victims belonging to different racial categories. The Hispanic group forms the largest percentage of victims, comprising over a third of the total population. White and Black victims are the next most represented groups, with White victims being significantly higher than Black. Unknown and Others categories form smaller, but still notable, portions of the distribution.
# crime severity with age and gender
ggplot(data[!is.na(data$vict_age) & !is.na(data$vict_sex) & data$vict_sex != "H", ], aes(x = age_group, fill = severity_label)) +
geom_bar(position = "dodge", alpha = 0.7) +
facet_wrap(~ vict_sex) +
theme_minimal() +
labs(
title = "Crime Severity Distribution by Age Group and Gender",
x = "Age Group",
y = "Count",
fill = "Crime Severity"
) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)

This grouped bar chart provides a detailed view of how crime severity differs across gender and age groups, highlighting significant trends in both dimensions.
For Females (F):
Young adults have the highest crime counts, with less serious crimes being slightly more frequent than serious crimes.
Middle-aged people also show significant crime counts, though less than young adults, and the pattern between crime severities is similar.
Crime counts for juveniles and the elderly are much lower.
For Males (M):
Young adults dominate in crime counts, with serious crimes being more frequent than less serious ones (reversing the trend seen in females).
Middle-aged people follow with high counts, though less than young adults, and serious crimes still dominate.
Crime counts for juveniles and the elderly are low, similar to females.
For Non-Binary/Other (X):
Crime counts are minimal across all age groups, but young adults and middle-aged people have slightly higher counts compared to juveniles and the elderly.
Overall, males commit more serious crimes, especially among young adults and middle-aged groups, compared to females. Non-binary/other individuals have relatively low crime counts across all groups. Young adults and middle-aged people dominate in crime counts, while juveniles and the elderly have lower crime involvement across all genders.
# Crime Severity with Age Group and Race
ggplot(data[!is.na(data$vict_age) & !is.na(data$vict_descent) & data$vict_descent != "-", ],
aes(x = age_group, fill = severity_label)) +
geom_bar(position = "fill", alpha = 0.7) +
facet_wrap(~ vict_descent) +
theme_minimal() +
labs(title = "Crime Severity by Age Group and Race",
x = "Age Group",
y = "Proportion",
fill = "Crime Severity") +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)

This faceted stacked bar chart provides a detailed view of how crime severity varies across age groups and racial categories, allowing for analysis of demographic patterns. Young adults and middle-aged people generally dominate in terms of crime proportions across most racial categories. Juveniles and the elderly consistently have smaller proportions of crimes in all racial categories. The proportion of serious crimes tends to vary across racial groups and age groups. In some racial categories (e.g., D, S, V), serious crimes are slightly higher for young adults and middle-aged groups. In other racial categories (e.g., A, K, X), the proportions of less serious crimes (purple) dominate across all age groups.